78 research outputs found
Making Pre-trained Language Models both Task-solvers and Self-calibrators
Pre-trained language models (PLMs) serve as backbones for various real-world
systems. For high-stake applications, it's equally essential to have reasonable
confidence estimations in predictions. While the vanilla confidence scores of
PLMs can already be effectively utilized, PLMs consistently become
overconfident in their wrong predictions, which is not desirable in practice.
Previous work shows that introducing an extra calibration task can mitigate
this issue. The basic idea involves acquiring additional data to train models
in predicting the confidence of their initial predictions. However, it only
demonstrates the feasibility of this kind of method, assuming that there are
abundant extra available samples for the introduced calibration task. In this
work, we consider the practical scenario that we need to effectively utilize
training samples to make PLMs both task-solvers and self-calibrators. Three
challenges are presented, including limited training samples, data imbalance,
and distribution shifts. We first conduct pilot experiments to quantify various
decisive factors in the calibration task. Based on the empirical analysis
results, we propose a training algorithm LM-TOAST to tackle the challenges.
Experimental results show that LM-TOAST can effectively utilize the training
data to make PLMs have reasonable confidence estimations while maintaining the
original task performance. Further, we consider three downstream applications,
namely selective classification, adversarial defense, and model cascading, to
show the practical usefulness of LM-TOAST. The code will be made public at
\url{https://github.com/Yangyi-Chen/LM-TOAST}.Comment: Accepted to Findings of ACL 202
Bridge the Gap Between CV and NLP! An Optimization-based Textual Adversarial Attack Framework
Despite recent success on various tasks, deep learning techniques still
perform poorly on adversarial examples with small perturbations. While
optimization-based methods for adversarial attacks are well-explored in the
field of computer vision, it is impractical to directly apply them in natural
language processing due to the discrete nature of the text. To address the
problem, we propose a unified framework to extend the existing
optimization-based adversarial attack methods in the vision domain to craft
textual adversarial samples. In this framework, continuously optimized
perturbations are added to the embedding layer and amplified in the forward
propagation process. Then the final perturbed latent representations are
decoded with a masked language model head to obtain potential adversarial
samples. In this paper, we instantiate our framework with an attack algorithm
named Textual Projected Gradient Descent (T-PGD). We find our algorithm
effective even using proxy gradient information. Therefore, we perform the more
challenging transfer black-box attack and conduct comprehensive experiments to
evaluate our attack algorithm with several models on three benchmark datasets.
Experimental results demonstrate that our method achieves an overall better
performance and produces more fluent and grammatical adversarial samples
compared to strong baseline methods. All the code and data will be made public.Comment: Codes are available at: https://github.com/Phantivia/T-PG
M22: A Communication-Efficient Algorithm for Federated Learning Inspired by Rate-Distortion
In federated learning (FL), the communication constraint between the remote
learners and the Parameter Server (PS) is a crucial bottleneck. For this
reason, model updates must be compressed so as to minimize the loss in accuracy
resulting from the communication constraint. This paper proposes ``\emph{-magnitude weighted distortion + degrees of freedom''}
(M22) algorithm, a rate-distortion inspired approach to gradient compression
for federated training of deep neural networks (DNNs). In particular, we
propose a family of distortion measures between the original gradient and the
reconstruction we referred to as ``-magnitude weighted '' distortion,
and we assume that gradient updates follow an i.i.d. distribution --
generalized normal or Weibull, which have two degrees of freedom. In both the
distortion measure and the gradient, there is one free parameter for each that
can be fitted as a function of the iteration number. Given a choice of gradient
distribution and distortion measure, we design the quantizer minimizing the
expected distortion in gradient reconstruction. To measure the gradient
compression performance under a communication constraint, we define the
\emph{per-bit accuracy} as the optimal improvement in accuracy that one bit of
communication brings to the centralized model over the training period. Using
this performance measure, we systematically benchmark the choice of gradient
distribution and distortion measure. We provide substantial insights on the
role of these choices and argue that significant performance improvements can
be attained using such a rate-distortion inspired compressor.Comment: arXiv admin note: text overlap with arXiv:2202.0281
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Vision-language models (VLMs) have recently demonstrated strong efficacy as
visual assistants that can parse natural queries about the visual content and
generate human-like outputs. In this work, we explore the ability of these
models to demonstrate human-like reasoning based on the perceived information.
To address a crucial concern regarding the extent to which their reasoning
capabilities are fully consistent and grounded, we also measure the reasoning
consistency of these models. We achieve this by proposing a chain-of-thought
(CoT) based consistency measure. However, such an evaluation requires a
benchmark that encompasses both high-level inference and detailed reasoning
chains, which is costly. We tackle this challenge by proposing a
LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously
ensuring the generation of a high-quality dataset. Based on this pipeline and
the existing coarse-grained annotated dataset, we build the CURE benchmark to
measure both the zero-shot reasoning performance and consistency of VLMs. We
evaluate existing state-of-the-art VLMs, and find that even the best-performing
model is unable to demonstrate strong visual reasoning capabilities and
consistency, indicating that substantial efforts are required to enable VLMs to
perform visual reasoning as systematically and consistently as humans. As an
early step, we propose a two-stage training framework aimed at improving both
the reasoning performance and consistency of VLMs. The first stage involves
employing supervised fine-tuning of VLMs using step-by-step reasoning samples
automatically generated by LLMs. In the second stage, we further augment the
training process by incorporating feedback provided by LLMs to produce
reasoning chains that are highly consistent and grounded. We empirically
highlight the effectiveness of our framework in both reasoning performance and
consistency.Comment: The data is released at
\url{https://github.com/Yangyi-Chen/CoTConsistency
A Data-Centric Solution to NonHomogeneous Dehazing via Vision Transformer
Recent years have witnessed an increased interest in image dehazing. Many
deep learning methods have been proposed to tackle this challenge, and have
made significant accomplishments dealing with homogeneous haze. However, these
solutions cannot maintain comparable performance when they are applied to
images with non-homogeneous haze, e.g., NH-HAZE23 dataset introduced by NTIRE
challenges. One of the reasons for such failures is that non-homogeneous haze
does not obey one of the assumptions that is required for modeling homogeneous
haze. In addition, a large number of pairs of non-homogeneous hazy image and
the clean counterpart is required using traditional end-to-end training
approaches, while NH-HAZE23 dataset is of limited quantities. Although it is
possible to augment the NH-HAZE23 dataset by leveraging other non-homogeneous
dehazing datasets, we observe that it is necessary to design a proper
data-preprocessing approach that reduces the distribution gaps between the
target dataset and the augmented one. This finding indeed aligns with the
essence of data-centric AI. With a novel network architecture and a principled
data-preprocessing approach that systematically enhances data quality, we
present an innovative dehazing method. Specifically, we apply RGB-channel-wise
transformations on the augmented datasets, and incorporate the state-of-the-art
transformers as the backbone in the two-branch framework. We conduct extensive
experiments and ablation study to demonstrate the effectiveness of our proposed
method.Comment: Accepted by CVPRW 202
The state of water and fat during the maturation of Cheddar cheese
Cheddar cheese predicted to develop into different quality classes has been evaluated by time domain Nuclear Magnetic Resonance, Thermogravimetric analysis and quantitative sensory analysis. The water and fat proton signals in the transverse relaxation decay curves have been deconvoluted. Proton transverse relaxation values for both the water and fat fractions decrease and the relative %age of the proton peak area, predominantly from the fat increases over a 450-day ripening period. The thermodynamic free water percentage increases during maturation. Water and fat attributes can distinguish between Cheddar cheese batches after 56 days. Cheese batches which have lower transverse relaxation values for the water and fat proton fractions and a higher relative %age of the proton peak area predominantly from fat at 56 days, mature after 270 days to be more yellow, rubbery and smooth, have a less sour and lingering aftertaste and are also harder to form into a cheese ball
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
This paper reexamines the research on out-of-distribution (OOD) robustness in
the field of NLP. We find that the distribution shift settings in previous
studies commonly lack adequate challenges, hindering the accurate evaluation of
OOD robustness. To address these issues, we propose a benchmark construction
protocol that ensures clear differentiation and challenging distribution
shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution
robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we
conduct a series of experiments on pre-trained language models for analysis and
evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the
relationship between in-distribution (ID) and OOD performance. We identify
three typical types that unveil the inner learning mechanism, which could
potentially facilitate the forecasting of OOD robustness, correlating with the
advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and
find that, despite exhibiting some effectiveness in specific cases, they do not
offer significant improvement compared to vanilla fine-tuning. Further, we
evaluate 5 LLMs with various adaptation paradigms and find that when sufficient
ID data is available, fine-tuning domain-specific models outperform LLMs on ID
examples significantly. However, in the case of OOD instances, prioritizing
LLMs with in-context learning yields better results. We identify that both
fine-tuned small models and LLMs face challenges in effectively addressing
downstream tasks. The code is public at
\url{https://github.com/lifan-yuan/OOD_NLP}.Comment: Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is
available at \url{https://github.com/lifan-yuan/OOD_NLP
Selection of potential molecular markers for cheese ripening and quality prediction by NMR spectroscopy
© 2020 Elsevier Ltd Predicting cheese quality as early as possible after ripening is important for quality control in the cheese industry. The main aim of this study was to investigate potential metabolites for predictive models of Cheddar cheese quality. Metabolites in aqueous extracts of Cheddar cheese were identified by Nuclear Magnetic Resonance. The metabolites were used to measure the kinetics of up to 450 days ripening in Cheddar cheese. The proton ratios of citrulline and arginine relative to the overall proton content of the aqueous extract are the most important indices for assessing the ripening of Cheddar cheese. The ratios of citrulline and arginine decrease by 59% and 69%, respectively, after 450 days ripening. In comparison to the premium batch B cheese, batch C which was predicted to attain a lower quality level, had higher serine and β-galactose as well as lower lactic acid levels and also had a less mature sensorial profile. Tyrosine, tyramine and lysine are highly correlated with mature Cheddar cheese sensory attributes. β-Galactose and glycerol are correlated with young Cheddar cheese sensory attributes. These metabolites can be used to predict cheese quality
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework
Textual adversarial attacks can discover models' weaknesses by adding
semantic-preserved but misleading perturbations to the inputs. The long-lasting
adversarial attack-and-defense arms race in Natural Language Processing (NLP)
is algorithm-centric, providing valuable techniques for automatic robustness
evaluation. However, the existing practice of robustness evaluation may exhibit
issues of incomprehensive evaluation, impractical evaluation protocol, and
invalid adversarial samples. In this paper, we aim to set up a unified
automatic robustness evaluation framework, shifting towards model-centric
evaluation to further exploit the advantages of adversarial attacks. To address
the above challenges, we first determine robustness evaluation dimensions based
on model capabilities and specify the reasonable algorithm to generate
adversarial samples for each dimension. Then we establish the evaluation
protocol, including evaluation settings and metrics, under realistic demands.
Finally, we use the perturbation degree of adversarial samples to control the
sample validity. We implement a toolkit RobTest that realizes our automatic
robustness evaluation framework. In our experiments, we conduct a robustness
evaluation of RoBERTa models to demonstrate the effectiveness of our evaluation
framework, and further show the rationality of each component in the framework.
The code will be made public at \url{https://github.com/thunlp/RobTest}.Comment: Accepted to Findings of ACL 202
- …